About the project

This course is about the core themes of data science such as visualising, analyzing and interpreting data. This course is for everyone who wants to become a better data scientist! I also expect that this course will improve my coding skills. My github repository: https://github.com/Pedusal/IODS-project


Week 2

I have done a model that explores students exam scores. I found that the attitude were the most significant variable when explaining the differencies in exam scores. I have learned how to visualize data and many cool plots that can be used studing the model validitation.

This dataset consist answers to a survey which was done in course Introduction to Social Statistics, fall 2014. The survey questions were related to students learning approaches. Questions were divided to three different categories: deep learning, surface learning and strategic learning. In this dataset answers to those question categories has been compined to columns: deep, surf and stra. The dataset includes also students ages, gender, attitude (a sum of 10 questions related to students attitude towards statistics) and test scores.

Structure and dimension of the dataset:

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
## [1] 166   7

Graphical overview of the data and summaries of the variables in the data:

## Loading required package: ggplot2

The graphical overview shows us that variables attitude, stra and surf have normal distribution. Variable age disturibution is clearly toward left. Distributions of variables deep and points are toward right, allthough points distribution have also a quite fat tail on left side.

Attitude variable is correlated with points in both genders and the correlation is about 0.43. Also variable age is correlated with points but only among males. Variable surf has some correlation with variables attitude, deep and stra among males. All other variables have only little correlation or no correlation at all.

I chose variables attitude, stra and surf as explanatory variables and fitted a regression model where exam points is the target variable.

Summary of the fitted model:

## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The statistical test and p-value measures that how likely the estimate is zero. Eg. if the p-value is very low then most likely the estimate is not zero and we can say that it is statistically significant.

In my model only attitude is statistically significant so I run regresion again using only that explanatory variable.

Summary of the new model:

## 
## Call:
## lm(formula = Points ~ Attitude, data = learn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The estimate of the attitude variable is ~0.35 and it is statistically significant. This means that points and attitude are in relationship, where one additional attitude point gives 0.35 points more in exam according this model.

The multiple R-squered of this model is 0.19, which means that this model explain about 20% of changes in dependent variable (exam points). In other words, according this modele attitude towards statistics explains 20 prosent of differencies in Introduction to Social Statistics exam results.

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage:

First the assumption that the errors of the model are normally distributed. QQ-plot of the residuals shows that this assumption is reosonable. The constant variance assumption implies that the size of the errors should not depend on the explanatory variables. This can be studied with a scatter plot of residuals versus model predictions. ALso this assumption seems to be valid since there is no any clear pattern in the scatter plot. Third plot shows us leverage of observations. Leverage measures how much impact a single observation has on the model. Based on that plot I would say that there is no single observation that would have too much leverage on the model.


Week3

This week dataset combines two dataset that approach student achievement in secondary education of two Portuguese schools. Datasets provides info about students performance in two different subject. Both of these datasets includes same information regarding students backround.

Names of the variables of the this week dataset:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

I chose to study variables age, sex, absences and family relationship (famrel) more closely and their relationship with alcohol consumption. My hypothesis is that age, sex and absences correlates positively with alcohol consumption and famrel correlates negatively. In addition I expect that male students use more alcohol than female.

The distributions of the chosen variables:

## Warning: attributes are not identical across measure variables;
## they will be dropped

There is no normal distribution among my variables. The absences variable does not have any reosonable pattern. A typical student age is between 15-18 and he has good quality of family relationship. Gender is roughly equally distributed among students.

It seems to be so that older students consume more alcohol than younger as I expected.

Students who have more absences seems to comsume more alcohol than students who do not skip classes, as I expected.

There is less high alcohol use among those students who have a good quality of family relationships.

Thre are more male students among high alcohol consumers according to this graph.

All in all, these plots support my earlier hypothesis.

Summary of the fitted model:

## 
## Call:
## glm(formula = high_use ~ sex + age + famrel + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2270  -0.8362  -0.6109   1.0447   2.1507  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.77961    1.74949  -2.160   0.0307 *  
## sexM         1.03576    0.24431   4.240 2.24e-05 ***
## age          0.18794    0.10214   1.840   0.0658 .  
## famrel      -0.30293    0.12784  -2.370   0.0178 *  
## absences     0.08890    0.02272   3.913 9.10e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 421.43  on 377  degrees of freedom
## AIC: 431.43
## 
## Number of Fisher Scoring iterations: 4

All the variables are statistically significant. Although the age variable is statistically significant only in 0.1 significance level. SexM and absences variables are statistically the most significant variables. Gender and quality of family relationships seems to have quite large impact on students high use of alcohol.

The coefficients of the model as odds ratios and confidence intervals for them:

## Waiting for profiling to be done...
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02283157 0.0007056798 0.6843933
## sexM        2.81725434 1.7567833577 4.5865176
## age         1.20676457 0.9891582297 1.4778200
## famrel      0.73865126 0.5737167939 0.9488277
## absences    1.09297279 1.0473815343 1.1452177

These results are in line with my earlier stated hypothesis.

##     failures absences sex high_use probability prediction
## 373        1        0   M    FALSE   0.4050055      FALSE
## 374        1        7   M     TRUE   0.4370325      FALSE
## 375        0        1   F    FALSE   0.1391479      FALSE
## 376        0        6   F    FALSE   0.2544631      FALSE
## 377        1        2   F    FALSE   0.1757312      FALSE
## 378        0        2   F    FALSE   0.1930123      FALSE
## 379        2        2   F    FALSE   0.3047678      FALSE
## 380        0        3   F    FALSE   0.3934424      FALSE
## 381        0        4   M     TRUE   0.5500634       TRUE
## 382        0        2   M     TRUE   0.4025644      FALSE
##         prediction
## high_use FALSE TRUE
##    FALSE   256   12
##    TRUE     80   34

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67015707 0.03141361 0.70157068
##    TRUE  0.20942408 0.08900524 0.29842932
##    Sum   0.87958115 0.12041885 1.00000000
## [1] 0.2408377

Week 4 - Clustering and classification

Description of the dataset

The datasets is part of R package “MASS” and it consists housing values in suburbs of Boston. More details about variables can be founf from here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Structure of the data:

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dimensions of the data:

## [1] 506  14

Graphical overview of the data

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Variables rad and tax have the highest correlation (0,91). Other variable pairs that have high correlation (in abosolute value greater than 0,7) are “indus,nox”, “indus,tax”, “nox,age”, “rm,mediv”, “indus,dis”, “nox,dis”, “age,dis” and “lstat,medv”. Distributions of variables crim, dis and lstat are clearly skewed towards left. Rm and medv have quite normal distribution. Variables age and black have the distuributions that are skewed towards right.

Summaries of the variables in the data:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Scaling the data

Summaries of the variables in the scaled data:

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

All variables have now a zero mean.

LDA

LDA (bi)plot:

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2500000 0.2400990 0.2524752 
## 
## Group means:
##                  zn       indus        chas        nox          rm
## low       0.9444027 -0.89818627 -0.12090214 -0.8777896  0.46730537
## med_low  -0.0930677 -0.35655390 -0.07742312 -0.5922344 -0.11443880
## med_high -0.3882249  0.08086577  0.17414622  0.2978238  0.07805081
## high     -0.4872402  1.01710965 -0.11793298  1.0864317 -0.37358515
##                 age        dis        rad        tax    ptratio      black
## low      -0.8795921  0.8399873 -0.6826072 -0.7453656 -0.4280419  0.3791846
## med_low  -0.4221774  0.4117277 -0.5384037 -0.5070395 -0.0489752  0.3155679
## med_high  0.3608703 -0.3086763 -0.4431573 -0.3577176 -0.2108896  0.1047946
## high      0.8085376 -0.8459554  1.6382099  1.5141140  0.7808718 -0.8491276
##                 lstat        medv
## low      -0.754222303  0.55294176
## med_low  -0.174238992  0.03335807
## med_high  0.005807658  0.14820643
## high      0.838985979 -0.68643330
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08006000  0.69305966 -0.97550180
## indus    0.11442784 -0.16891135  0.31275943
## chas    -0.12928575 -0.10570776  0.05060807
## nox      0.39905538 -0.78521198 -1.28526077
## rm      -0.16009149 -0.06302664 -0.17573362
## age      0.13537099 -0.39072686 -0.27164268
## dis     -0.03887342 -0.30622752  0.22228630
## rad      3.66530248  1.09408405  0.02741219
## tax      0.05086760 -0.13931149  0.52651305
## ptratio  0.10470326 -0.03579711 -0.22860249
## black   -0.12236647  0.04573984  0.15173691
## lstat    0.17713001 -0.18804448  0.39326041
## medv     0.20745449 -0.35344309 -0.12003974
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9630 0.0277 0.0094

Variable rad is clearly the most influencial linear separator for the clusters.

Predicting with LDA model

Cross table of the results with the crime categories from the test set:

##           predicted
## correct    low med_low med_high high
##   low       17       5        1    0
##   med_low    5      11        9    0
##   med_high   0       3       23    3
##   high       0       0        0   25

The model seems to work quite well!

Clustering the (scaled)dataset

In this part I Calculate the distances between the observations, run k-means algorithm on the data and then investigate what is the optimal number of clusters and run the algorithm again.

Based on above total within sum of squres plot, I would say that 2 is optimal number of clusters.

Visualization of the clusters:

Bonus

Variables age and rad are the most influencial linear separators for the clusters.

Super-Bonus

## [1] 404  13
## [1] 13  3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Week 5 - Dimensionality reduction techniques

Description of the dataset

Summaries of the variables in the data:

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Distributions of variables Ado.birrth, GNI, Parli.F and Mat.Mor are clearly skewed towards left. Edu.Exp has quite normal distribution. Variables Life.Exp and Labo.FM have the distuributions that are skewed towards right.

Correlations between variables:

Variables Life.Exp and Mat.Mor have the highest correlation in absolutevalue (-0,86). Other variable pairs that have high correlation (in abosolute value greater than 0,7) are “Mat.Mor,Edu.Exp”, “Edu.Exp,Life.Exp”, and “Mat.Mor,Ado.Birth”.

Principal component analysis (PCA) on the not standardized data

PC1 captures almost 100% of the variance in the data

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

PCA on the standardized data

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

Tea time !

The tea dataset is from FactoMineR package

visualize the dataset

##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300   6
## Warning: attributes are not identical across measure variables;
## they will be dropped

Summary of the model

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |